Extracting Triangular 3D Models, Materials, and Lighting From Images

In this article, we'll explore a novel and efficient approach for joint optimization of topology, materials, and lighting from multi-view image observations.
Soumik Rakshit
Created on May 4|Last edited on February 13
Comment
3D content creation is a challenging, mostly manual task that requires both artistic modeling skills and technical knowledge. Efforts to automate 3D modeling can save substantial production costs or allow faster and more diverse content creation. 
Currently, game studios often make use of Photogrammetry techniques to build highly detailed virtual landscapes quickly. However, this is a complex multi-stage pipeline that has many steps with conflicting optimization goals and errors that propagate between stages which further require further manual efforts from artists to reach the desired quality of the final 3D model.
Is it really possible to frame this process as an inverse rendering task, and optimize as many steps as possible jointly, driven by the quality of the rendered images of the reconstructed model, compared to the captured input imagery?This is the question that the authors of the paper Extracting Triangular 3D Models, Materials, and Lighting From Images attempt to answer. In this paper, the authors propose a highly efficient inverse rendering pipeline capable of extracting triangular meshes of unknown topology with spatially-varying materials and lighting from multi-view images, assuming that the object is illuminated under one unknown environment lighting condition and that we have the corresponding camera poses and masks indicating the object in these images. 
The proposed approach learns topology and vertex positions for a surface mesh without requiring any initial guess for the 3D geometry. At the heart of this novel pipeline is a differentiable surface model based on a deformable tetrahedral mesh, which is extended to support spatially varying materials and high dynamic range (HDR) environment lighting through a novel differentiable split sum approximation.
﻿
﻿
In this panel, we can see the reconstruction of a triangular mesh with unknown topology, spatially-varying materials, and lighting from a set of multiview images. The object in this scenario is reconstructed from real-world photographs and then used as a collider for a virtual cloth object. Note that in this case, the learned models not only receive scene illumination and cast accurate shadows but also robustly act as colliders for virtual objects.2
﻿
This article was written as a Weights & Biases Report which is a project management and collaboration tools for machine learning projects. Reports let you organize and embed visualizations, describe your findings, share updates with collaborators, and more. To know more about reports, check out Collaborative Reports.
💡
﻿
Here's what we'll be covering in this article: 
Table of ContentsBusting the JargonExisting Approaches and Their LimitationsThe Approach in DetailsExperiments and ResultsLimitations of the ApproachConclusionSimilar Reports
﻿
﻿
﻿
Busting the JargonPhotogrammetry is the art, science, and technology of obtaining reliable information about physical objects and the environment through recording, measuring, and interpreting photographic images and patterns of recorded radiant electromagnetic energy and other phenomena. For more information regarding the usage of this technique to capture 3D models, you can refer to the articles From images to 3D models and Photo tourism: exploring photo collections in 3D.
Inverse Rendering refers to the estimation of intrinsic scene characteristics given a single photo or a set of photos of the scene. While predicting these characteristics from 2D projects is highly underconstrained, recent advances have made a big step in solving this problem. For a quick primer into inverse rendering, you can check out this amazing video from ICCV 2015.
Deformable Tetrahedral Meshes or DefTet is a particular parameterization of 3D shapes that utilizes volumetric tetrahedral meshes for the reconstruction problem. It optimizes for both vertex placement and occupancy and is differentiable with respect to standard 3D reconstruction loss functions. It is thus simultaneously high-precision, volumetric and amenable to learning-based neural architectures. For more information, refer to the paper Learning Deformable Tetrahedral Meshes for 3D Reconstruction.
High Dynamic Range Rendering, or HDRR, is the real-time rendering and display of virtual environments using a dynamic range of 65535:165535:165535:1﻿ or higher (used in gaming and entertainment technology). This allows the preservation of details lost due to limiting contrast ratios. For more information on HDRR, you can check out the corresponding Wikipedia article.
A signed Distance Field, or SDF, is a function that takes a position as an input and outputs the distance from that position to the nearest part of a shape. For more information on SDF, you can refer to this article.
BRDF, or Bidirectional Reflectance Distribution Function, is a function of four real variables that defines how light is reflected at an opaque surface. It is employed in the optics of real-world light, in computer graphics algorithms, and in computer vision algorithms. For more information on BRDF, you can check the corresponding Wikipedia article.
﻿
Existing Approaches and Their Limitations
Multi-view 3D Reconstruction
Classical MethodsClassical methods for multi-view 3D reconstructions, such as Building Rome in a day and Pixelwise View Selection for Unstructured Multi-View Stereo, exploit inter-image correspondences to estimate depth maps. These methods typically fuse depth maps into point clouds, optionally generating meshes. They rely heavily on the quality of matching, and errors are hard to rectify during post-processing.
Another set of classical methods, such as Poxels: Probabilistic Voxelized Volume Reconstruction, uses voxel grids to represent shapes. These methods estimate occupancy and color for each voxel and are often limited by the cubic memory requirement.
Neural Implicit RepresentationsThese approaches leverage differentiable rendering to reconstruct 3D geometry with an appearance from image collections.
﻿Neural Radiance Fields and its follow-up works use volumetric representations and compute radiance by ray marching through a neurally encoded 5D light field. While achieving impressive results on novel view synthesis, geometric quality suffers from the ambiguity of volume rendering.
Surface-based rendering methods such as Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision, and Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance use implicit differentiation to obtain gradients, optimizing the underlying surface directly. Unisurf is a hybrid approach that gradually reduces the sampling region, encouraging a volumetric representation to converge to a surface. NeuS provides an unbiased conversion from a signed distance field into density for volume rendering. The problem with all these methods is their reliance on ray marching for rendering, which is computationally expensive during training and inference.
Volumetric and implicit shape representations such as SDFs and NeRFs can be converted to meshes through Marching Cubes in a post-processing step. However, Marching Cubes inevitably imposes discretization errors. As a result, the output mesh quality, particularly at the moderate triangle counts typically used in real-time rendering, is often not sufficient.
Explicit Surface RepresentationsThese methods work by estimating an explicit 3D mesh from images, given a fixed mesh topology. Methods such as DMTet directly optimize the surface mesh using a differentiable marching tetrahedral layer. However, it focuses on training with 3D supervision.
Lighting EstimationOlder works on Bidirectional Texture Function (BTF) and Spatially varying BRDF (SVBRDF) estimation rely on special viewing configurations, lighting patterns, or complex capturing setups.
More recent methods, such as MaterialGAN, use neural networks to predict BRDF from images. Differentiable rendering methods such as DIB-R++ and Appearance-Driven Automatic 3D Model Simplification learn to predict geometry, SVBRDF, and, in some cases, lighting via 2D image loss. Still, their shape is generally deformed from a sphere and cannot represent arbitrary topology.
Neural implicit representations successfully estimate lighting and BRDF from image collections. Neural Reflectance Fields for Appearance Acquisition and NeRV model light transport support advanced lighting effects, e.g., shadows, but have high computational costs.
﻿
The Approach in Details
OverviewThe authors present a method for 3D reconstruction supervised by multi-view images of an object illuminated under one unknown environment lighting condition, together with known camera poses and background segmentation masks.
The target representation consists of triangle meshes, spatially-varying materials (stored in 2D textures), and lighting (a high dynamic range environment probe).
We carefully design the optimization task to render triangle meshes while robustly handling arbitrary topology explicitly.
The authors adapt DMTet to work in the setting of 2D supervision and jointly optimize shape, materials, and lighting.
At each optimization step, the shape representation, i. e, parameters of a signed distance field (SDF) defined on a grid with corresponding per-vertex offsets – is converted to a triangular surface mesh using a marching tetrahedra layer.
Next, the extracted surface mesh is rendered in a differentiable rasterizer with deferred shading and computed loss in image space on the rendered image compared to a reference image.
Finally, the loss gradients are back-propagated to update the shape, textures, and lighting parameters.
An overview of the entire inverse rendering pipeline proposed by the authors of Extracting Triangular 3D Models, Materials, and Lighting From Images﻿
Unlike most recent work using neural implicit surface or volumetric representations, in the proposed rendering pipeline the target shape representation is directly optimized.
💡
Problem FormulationThe optimization task of the proposed differentiable rendering pipeline is formulated to minimize the following empirical risk:
argmin⁡ϕEc[L(Iϕ(c),Iref(c))]\Large{\underset{\phi}{\operatorname{argmin}} \mathbb{E}_{c}\left[L\left(I_{\phi}(c), I_{\mathrm{ref}}(c)\right)\right]}ϕargmin​Ec​[L(Iϕ​(c),Iref​(c))]﻿
where...
﻿ϕ\phiϕ﻿ denotes our optimization parameters (i.e., SDF values and vertex offsets representing the shape, spatially varying material, and light probe parameters).
﻿ccc﻿ is a given camera pose.
﻿Iϕ(c)I_{\phi}(c)Iϕ​(c)﻿ is the image produced by the differentiable renderer for the given camera pose ccc﻿.
﻿Iref(c)I_{\mathrm{ref}}(c)Iref​(c)﻿ is the reference image which corresponds to the view from the same camera.
﻿LLL﻿ is the loss function given by L=Limage+Lmask+λLregL = L_{image} + L_{mask} + \lambda L_{reg}L=Limage​+Lmask​+λLreg​﻿ where,
﻿LimageL_{image}Limage​﻿ is an image space loss which is basically the L1L_{1}L1​﻿ norm of the tone mapped colors
﻿LmaskL_{mask}Lmask​﻿ is a mask loss which is basically a squared L2L_{2}L2​﻿ loss, and
﻿LregL_{reg}Lreg​﻿ is a regularization loss for regularizing the SDF values of DMTet to reduce floaters and internal geometry
This empirical loss is optimized using the Adam Optimizer based on gradients w.r.t. the optimization parameters, ∂L∂ϕ\frac{\partial L}{\partial \phi}∂ϕ∂L​﻿, which are obtained through differentiable rendering. The proposed renderer uses physically based shading and produces images with high dynamic range. Therefore, the objective function must be robust to the full range of floating-point values.
Learning the TopologyThe proposed rendering pipeline explicitly renders triangle meshes during optimization in order to avoid discretization errors posed by Marching Cubes in a post-processing step.
The authors leverage Deep Marching Tetrahedra or DMTet in a 2D supervision setting through differentiable rendering. DMTet is a hybrid 3D representation that represents a shape with a discrete SDF defined on vertices of a deformable tetrahedral grid.
The SDF is converted to triangular mesh using a differentiable marching tetrahedra layer.
The loss is computed on renderings of the 3D model and is back-propagated to the implicit field to update the surface topology. This allows us to optimize the surface mesh and rendered appearance end-to-end directly.
Given a tetrahedral grid with vertex positions vvv﻿, DMTet learns SDF values sss﻿, and deformation vectors Δv\Delta vΔv﻿.
The SDF values and deformations can either be stored explicitly as values per grid vertex or implicitly by a neural network.
At each optimization step, the SDF is first converted to a triangular surface mesh using Marching Tetrahedra, which is shown to be differentiable w.r.t. SDF and can change surface topology in DMTet.
Next, the extracted mesh is rendered using a differentiable rasterizer to produce an output image. Image-space loss gradients are back-propagated to the SDF values and offsets (or network weights).
A neural SDF representation can act as a smoothness prior, which can be beneficial in producing well-formed shapes. Directly optimizing per-vertex attributes, on the other hand, can capture higher frequency detail and is faster to train. In practice, the optimal choice of parametrization depends on the ambiguity of geometry in multi-view images.
A representation of Marching Tetrahedra extracting faces from a tetrahedral grid.
Shading ModelThe authors follow Appearance-Driven Automatic 3D Model Simplification for differentiable rendering and the PBR material model from Practical Physically Based Shading in Film and Game Production in order to easily be able to import game assets and render the optimized models directly in existing engines without modifications.
The proposed rendering pipeline optimizes mesh topology, which requires continually updating the parametrization, potentially introducing discontinuities into the training process. To robustly handle texturing during topology optimization, the authors use volumetric texturing and use world space position to index into our texture. This ensures that the mapping varies smoothly with vertex translations and changing topology.
In order to manage the high memory footprint of volumetric textures that grows cubically, the authors extend the approach proposed by PhySG, using a multilayer perceptron (MLP) to encode all material parameters in a compact representation. This representation can adaptively allocate detail near the 2D manifold representing the surface mesh, a small subset of the dense 3D volume. Formally speaking, the authors let a positional encoding + MLP represent a mapping X→(kd,kOrm,n)\mathbf{X} \rightarrow\left(\boldsymbol{k}_{d}, \boldsymbol{k}_{\mathrm{Orm}}, \mathbf{n}\right)X→(kd​,kOrm​,n)﻿, i.e, given a world space position x\mathbf{x}x﻿, compute the base color, kd\boldsymbol{k}_{d}kd​﻿, specular parameters, kOrm\boldsymbol{k}_{\mathrm{Orm}}kOrm​﻿ (roughness, metalness), and a tangent space normal perturbation given by n\mathbf{n}n﻿.
Once the topology and MLP texture representation have converged, the model is re-parametrized. The authors generate unique texture coordinates using xatlas and sample the MLP on the surface mesh to initialize 2D textures, then continue the optimization with fixed topology.
Image-Based LightingThe authors adopt an image-based lighting model, where the scene environment light is given by a high-resolution cube map. Following the rendering equation, the outgoing radiance L(ω0)L(\omega_{0})L(ω0​)﻿ in the direction ωO\omega_{O}ωO​﻿ by
L(ωo)=∫ΩLi(ωi)f(ωi,ωo)(ωi⋅n)dωi\Large{L\left(\omega_{o}\right)=\int_{\Omega} L_{i}\left(\omega_{i}\right) f\left(\omega_{i}, \omega_{o}\right)\left(\omega_{i} \cdot \mathbf{n}\right) d \omega_{i}}L(ωo​)=∫Ω​Li​(ωi​)f(ωi​,ωo​)(ωi​⋅n)dωi​﻿
This is an integral of the product of the incident radiance, Li(ωi)L_{i}(\omega_{i})Li​(ωi​)﻿ from the direction ωi\omega_{i}ωi​﻿ and the BSDF f(ωi,ω0)f(\omega_{i}, \omega_{0})f(ωi​,ω0​)﻿. The integration domain is the hemisphere Ω\OmegaΩ﻿ around the surface intersection normal n\mathbf{n}n﻿.
For the specular part of the outgoing radiance, the BSDF is a Cook-Torrance microfacet specular shading model given by
f(ωi,ωo)=DGF4(ωo⋅n)(ωi⋅n)\LARGE{f\left(\omega_{i}, \omega_{o}\right)=\frac{D G F}{4\left(\omega_{o} \cdot \mathbf{n}\right)\left(\omega_{i} \cdot \mathbf{n}\right)}}f(ωi​,ωo​)=4(ωo​⋅n)(ωi​⋅n)DGF​﻿
where D, G, and F are functions representing the GGX normal distribution or NDF, geometric attenuation, and Fresnel term, respectively.
Cook-Torrance is a microfacet model, which means that it approximates surfaces as a collection of small individual faces called microfacets. One can think of these microfacets as being polygonal approximations of the surface at a sub-pixel level. For more information of microfacet models in PBR pipelines, you can refer to this amazing article.﻿﻿
💡
The authors instead draw inspiration from real-time rendering, where the split sum approximation is an efficient method for all-frequency image-based lighting. Here, the lighting integral from the aforementioned rendering equation is approximated as
L(ωo)≈∫Ωf(ωi,ωo)(ωi⋅n)dωi∫ΩLi(ωi)D(ωi,ωo)(ωi⋅n)dωi\Large{L\left(\omega_{o}\right) \approx \int_{\Omega} f\left(\omega_{i}, \omega_{o}\right)\left(\omega_{i} \cdot \mathbf{n}\right) d \omega_{i} \int_{\Omega} L_{i}\left(\omega_{i}\right) D\left(\omega_{i}, \omega_{o}\right)\left(\omega_{i} \cdot \mathbf{n}\right) d \omega_{i}}L(ωo​)≈∫Ω​f(ωi​,ωo​)(ωi​⋅n)dωi​∫Ω​Li​(ωi​)D(ωi​,ωo​)(ωi​⋅n)dωi​﻿
where...
The first term represents the integral of the specular BSDF with a solid white environment light. It only depends on the parameters cosθ=ωincos \theta = \omega_{i}ncosθ=ωi​n﻿ and the roughness rrr﻿ of the BSDF and can be precomputed and stored in a 2D lookup texture.
The second term represents the integral of the incoming radiance with the specular NDF DDD﻿. This term is also pre-integrated and represented by a filtered cubemap. In each mip-level, the environment map is integrated against D for a fixed roughness value.
The authors introduce a differentiable version of the split sum shading model to learn environment lighting from image observations through differentiable rendering.
💡
﻿
Experiments and Results
Training and ValidationThe following panel demonstrates the process of training along with the reference model. We can observe that the pipeline gives us a result indicative of the final quality after just a few minutes.
﻿
Train and Validate1
﻿
The aforementioned video was logged onto a Weights & Biases Panel that enables you to use visualizations to explore your logged data, the relationships between hyperparameters and output metrics, and dataset examples. You can also embed your panels into your Reports!!!
💡
3D Model ExtractionThe authors observe that DMTet representation successfully learns challenging topology and materials jointly, even for highly specular models and when lit using high-frequency lighting. In the following panel, we demonstrate 2 examples from the Smithsonian 3D repository.
﻿
3D Model Extraction with known Lighting1
﻿
Scene Editing and SimulationThe factorized scene representation in the proposed differentiable rendering pipeline enables more advanced scene editing compared to density-based neural representations.
Relighting Quality of Reconstructed ModelIn the following table, we compare the relighting quality of our reconstructed model, rendered using the Blender Cycles path tracer, with the results of NeRFactor. We note that the proposed renderer produces more detailed results and outperforms NeRFactor in all metrics. The artifacts in the proposed pipeline come mainly from the mismatch between training (using a rasterizer), and inference (using full global illumination). In areas with strong shadowing or color bleeding, the material and geometry quality in the proposed renderer suffer.
﻿
Relighting Quality Comparison1
﻿
﻿
﻿
Relighting quality for a scene from the NeRFactor dataset, with our examples relit using Blender, and NeRFactor results.1
﻿
Note that the tables shown above were created using Weights & Biases Tables. The Table is a great data visualization tool that lets you iterate on datasets and understand model predictions. To know more about how to use Tables effectively, you can check out the official documentation.
💡
Deployment of Learned RepresentationThe representation learned by the proposed differential renderer can be directly deployed in the vast collection of 3D content generation tools available for triangle meshes. This greatly facilitates scene editing, which is still very challenging for neural volumetric representations.
﻿
Advanced scene editing example: the reconstructed models from the NeRFactor dataset are added to the Cornell box. Note that the learned models receive scene illumination and cast accurate shadows1
﻿
﻿
﻿
Advanced scene editing example: the reconstructed models from the NeRFactor dataset are used as soft-body simulation. Note that in this case, the learned models not only receive scene illumination and cast accurate shadows but also robustly act as colliders for virtual objects.1
﻿
Reconstruction from Photographs﻿
Reconstruction from photographs (datasets from NeRD), comparing our results with NeRD and NeRF. The proposed renderer scores higher in terms of image metrics, most likely due to the mesh representation enforcing opaque geometry, where competing algorithms rely on volumetric opacity. Despite inconsistencies in camera poses and masks, our results remain sharp while NeRF and NeRD suffer from floating or missing geometry.1
﻿
Comparison between Spherical Gaussians and Split Sum ApproximationsIn the following table, we demonstrate environment lighting approximated with Spherical Gaussians using 128 lobes vs. Split Sum. The training set consists of 256 path-traced images with Monte Carlo sampled environment lighting using a high-resolution HDR probe. 
﻿
Comparison between Spherical Gaussians and Split Sum1
﻿
Results on the NeRF Synthetic Dataset﻿
Decomposition results on the NeRF Synthetic dataset. The rendered models are demonstrated alongside the material textures: diffuse (k_d), roughness/metalness (k_orm), the normals, and the extracted lighting.1
﻿
﻿
Limitations of the ApproachThe authors find that the main limitation of the proposed differentiable inverse rendering pipeline is the simplified shading model that doesn't account for global illumination or shadows. This choice is intentional to accelerate optimization, but is a limiting factor for material extraction and relighting. However, with the current progress in differentiable path tracing, the authors hope that this limitation would be lifted in future work.
The proposed approach relies on alpha masks to separate foreground from background. While the method seems quite robust to corrupted masks, it would be beneficial to further incorporate this step into the system.
The proposed renderer uses a differentiable rasterizer with deferred shading, hence reflections, refractions, and translucency are not supported.
During optimization, the proposed renderer only renders direct lighting without shadows.
﻿
ConclusionIn this report, we discussed the paper Extracting Triangular 3D Models, Materials, and Lighting From Images which proposes a highly efficient inverse rendering pipeline capable of extracting triangular meshes of unknown topology with spatially-varying materials and lighting from multi-view images.
We discuss existing approaches to solve this problem, including classical methods such as Building Rome in a day, explicit surface representations such as DMTet, and neural implicit representation approaches such as Neural Radiance Fields.
We discussed an overview of the differentiable inverse rendering pipeline, including the problem formulation, topology learning, shading model, and image lighting.
We demonstrated the performance of the proposed pipeline for 3D model extraction, scene manipulation, simulation scenarios, and reconstruction from photographs. We have seen that results are on par with state-of-the-art view synthesis and material factorization while directly optimizing an explicit representation: triangle meshes with materials and environment lighting.
The representation learned by the proposed pipeline is, by design, directly compatible with modern 3D engines and modeling tools, which enables a vast array of applications and simplifies artist workflows.
The authors note the limitations of the proposed approach and discuss the possibilities of future works addressing these issues.
The authors note that apart from deep fakes, common to all scene reconstruction methods, there's little possibility of nefarious use cases of the proposed method.
﻿
Similar Reports
Block-NeRF: Scalable Large Scene Neural View Synthesis
Representing large city-scale environments spanning multiple blocks using Neural Radiance Fields
Generating Digital Painting Lighting Effects via RGB-space Geometry
Exploring the paper "Generating Digital Painting Lighting Effects via RGB-space Geometry" in which the authors propose an image processing algorithm to generate digital painting lighting effects from a single image.
3D Image Inpainting With Weights & Biases
In this article, we take a look at a novel way to convert a single RGB-D image into a 3D image, using Weights & Biases to visualize our results. 
Implementing NeRF in JAX
This article uses JAX to create a minimal implementation of 3D volumetric rendering of scenes represented by Neural Radiance Fields, using W&B to track all metrics.
EditGAN: High-Precision Semantic Image Editing
Robust and high-precision semantic image editing in real-time 
PoE-GAN: Generating Images from Multi-Modal Inputs
PoE-GAN is a recent, fascinating paper where the authors generate images from multiple inputs like text, style, segmentation, and sketch. We dig into the architecture, the underlying math, and of course, generate some images along the way. 
﻿
﻿
Add a comment
Tags: Articles, TMP, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.